# Replication Package for: Extractive versus Generative Language Models for Political Conflict Text Classification

**Author:** Shreyas Meher

**Affiliation:** Erasmus School of Social and Behavioural Sciences, Erasmus University Rotterdam

**Contact:** meher@essb.eur.nl

**Date:** July 18, 2025

## Overview

This replication package contains all data and code necessary to reproduce the 9 tables and 4 figures presented in the manuscript, "Extractive versus Generative Language Models for Political Conflict Text Classification."

The project is designed to be run via a single master script, `run.sh`, located in the root directory. It supports two modes of execution:

1.  **Verification (Default):** A fast process (under 2 minutes) that uses the provided, pre-computed model outputs to generate all tables and figures. This is the recommended mode for most users.
2.  **Full Recreation (Optional):** A computationally expensive process (several hours) that re-creates all model predictions from scratch. **Warning:** This mode requires considerable computational resources. The large language models (LLMs) are large, and running inference may not be feasible on all consumer-grade GPUs.

All analysis is automated. To begin, please review the **Computational Requirements** section below and then execute the `run.sh` script.

---
## Data Availability and Provenance Statements

This paper uses a combination of public data, data generated by the authors for this study, and pre-computed model outputs to ensure replicability.

### Statement about Rights
I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.

I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package.

### Summary of Availability
[x] All data required to replicate the results presented in the paper **are** publicly available as part of this package.

### Details on Data Sources

| Data Name | Files in Archive | Location in Archive | Provided | Citation / Note |
| :--- | :--- | :--- | :--- | :--- |
| Global Terrorism Database | `globalterrorismdb_...xlsx` | `/data/raw/` | Yes | START (2022) |
| Author-compiled Corpora | `raw_bc_data.csv`, `raw_ner_conll_data.csv` | `/data/raw/` | Yes | Greene, Cunningham (2006), DSTL (2018) |
| Pre-computed Model Outputs | `*.csv` | `/data/model_outputs/` | Yes | Generated by this package |
| Raw Data for Tables 7-8 | `raw_mbert_comparison_data.csv` | `/data/raw/` | Yes | Meher, Brandt (2025) |


**1. Global Terrorism Database (GTD)**
- **Description:** The data used for the multi-label event classification analyses (Figures 1-4, Tables 5-8) is derived from the Global Terrorism Database (GTD). The full dataset can be obtained from the official source after a registration process.
- **Location:** [https://www.start.umd.edu/gtd/](https://www.start.umd.edu/gtd/)
- **Provided File:** The specific extract used for the analyses is provided in this package for convenience at `/data/raw/globalterrorismdb_0522dist.xlsx`.

**2. Binary Classification and NER Data**
- **Description:** The raw text data for the binary classification and Named Entity Recognition (NER) tasks were compiled by the authors for this study.
- **Provided Files:** `/data/raw/raw_bc_data.csv` and `/data/raw/raw_ner_conll_data.csv`.

**3. Pre-computed Model Outputs**
- **Description:** To facilitate a fast and simple verification of our results, we provide the complete output files from our model inference runs. The default `run.sh` command uses these files to generate all tables and figures. The `run.sh full` command will overwrite these files with newly generated results.
- **Provided Files:** All files located in the `/data/model_outputs/` directory.

---
## Computational Requirements

This section details the software and hardware environment used to generate the results. The project's analyses were conducted across two distinct computational platforms.

### Software Requirements

- **Shell:** A `bash`-compatible shell is required to execute the master script (`run.sh`).
- **Python:** Version 3.12.4
  - All required packages are listed in `requirements.txt`. They can be installed via `pip install -r requirements.txt`.
- **R:** Version 4.4.1
  - Required packages: `jsonlite`, `data.table`, `ROCR`, `zoo`.
  - The `run.sh` script automatically checks for and installs these R packages.
- **Ollama:** (Required for `run.sh full` recreation run only)
  - **Description:** Ollama is the tool used to run the large language models (Llama 3.1, Gemma 2, etc.) locally.
  - **Installation:** Before attempting a `full` recreation, you must install Ollama.
    - **macOS & Windows:** Download and run the installer from the official website: [https://ollama.com/download](https://ollama.com/download)
    - **Linux:** Run the following command in your terminal:
      ```shell
      curl -fsSL [https://ollama.com/install.sh](https://ollama.com/install.sh) | sh
      ```
  - **Post-installation:** After installing, ensure the Ollama application is running before executing the `run.sh full` command.
  - **Required Models:** The script will automatically check if you have the required models and will provide the `ollama pull` commands if they are missing.


### Hardware Requirements & Runtimes

#### **Environment 1: Local Machine (for BC and NER tasks)**
The Binary Classification and Named Entity Recognition tasks (Tables 2, 3, 9) were conducted on a **Macbook Pro** with the following specifications:
- **Processor:** Apple M2 Pro (12-core CPU, 19-core GPU)
- **Memory:** 16GB of unified memory
- **Operating System:** macOS Sequoia 15.5 (build 24F74)
- **Runtime (Full Recreation):** Approximately 1-3 days

#### **Environment 2: DeltaAI HPC Cluster (for Multi-Class Event tasks)**
The more computationally intensive multi-class event classification analyses (Figures 1-4, Tables 5-8) were conducted on the **DeltaAI** high-performance computing (HPC) cluster at the National Center for Supercomputing Applications (NCSA)
- **Node Type:** NVIDIA GH200 Grace Hopper superchip compute nodes
- **CPU/GPU:** 72-core ARM Grace CPU and an NVIDIA H100 GPU with 96GB HBM3 memory
- **Runtime (Full Recreation):** Approximately 2-8 hours

#### **Overall Runtimes**
- **Approximate time needed for Verification Run:** < 10 minutes
- **Approximate time needed for Full Recreation Run:** 1-3 days (highly dependent on hardware)
- **Approximate storage space needed:** < 250 MB

---
## Instructions to Replicators

The entire replication process is automated via the `run.sh` script.

### 1. Setup Environment (Recommended First Step)
It is highly recommended to use a Python virtual environment.
```bash
# Navigate to the project's root directory
# Create and activate a new virtual environment
python3 -m venv replication_env
source replication_env/bin/activate

# Install all Python dependencies
pip install -r requirements.txt
```

### 2. Execute the Replication

From the project's root directory (with the virtual environment active), execute one of the following commands.

#### Verification Run (Default & Recommended)
This process is fast (under 2 minutes) and uses the pre-computed model outputs located in `/data/model_outputs/` to generate all figures and tables. It does not require a local Ollama setup.

```bash
# Make the master script executable (only needs to be done once)
chmod +x run.sh

# Run the verification
bash run.sh
````

#### Full Recreation Run (Optional)

This process runs the entire data pipeline from scratch, overwriting the files in `/data/model_outputs/`. It is computationally expensive (several hours) and requires a properly configured local Ollama instance with all models pulled.

```bash
bash run.sh full
```

Upon completion, a detailed log of the entire process will be saved to `run.log`. All generated tables and figures can be found in the `/tables/` and `/figures/` directories, respectively.

-----

## Description of Programs and Files

This section provides an overview of the key files and scripts in the replication package.

### Directory Structure

```
/
|-- README.md                  # This documentation
|-- run.sh                     # Master script to execute the entire replication
|-- requirements.txt           # Python package dependencies
|-- /code/                     # All R and Python source code
|-- /data/                     # All data files
|   |-- /raw/                  # Original, unprocessed source data
|   |-- /model_outputs/        # Pre-computed model predictions for fast verification
|-- /figures/                  # Output directory for generated figures
|-- /tables/                   # Output directory for generated tables
|-- run.log                    # Log file generated by run.sh
```

### List of Programs and Generated Outputs

The following table maps each figure and table in the manuscript to the script that generates it and the corresponding output file.

| Figure/Table \# | Generating Script | Output File |
| :--- | :--- | :--- |
| Figures 1-4 | `code/01_generate_figures.R` | `figures/Figure_1.pdf`, etc. |
| Table 2 | `code/02_generate_tables.py` | `tables/table2.tex` |
| Table 3 | `code/02_generate_tables.py` | `tables/table3.tex` |
| Table 5 | `code/02_generate_tables.py` | `tables/table5.csv` |
| Table 6 | `code/02_generate_tables.py` | `tables/table6.csv` |
| Table 7 (App.)| `code/02_generate_tables.py` | `tables/table7.tex` |
| Table 8 (App.)| `code/02_generate_tables.py` | `tables/table8.csv` |
| Table 9 (App.)| `code/02_generate_tables.py` | `tables/table9.tex` |

### Description of Key Scripts

  - **`run.sh`**: The main entry point. Automates dependency checks and executes all analysis and/or data recreation scripts in the correct order.
  - **`code/01_generate_figures.R`**: Reads the merged multi-class model outputs and generates the four figures (ROC curves, P-R curves, etc.) presented in the paper.
  - **`code/02_generate_tables.py`**: A consolidated Python script that reads all necessary raw and pre-computed data files to programmatically generate all nine tables for the manuscript and appendix.
  - **`code/recreate_*.py`**: A series of Python scripts (`recreate_bc_inference.py`, `recreate_ner_...`, `05_recreate_multiclass_data.py`) used exclusively by the `run.sh full` command to reproduce all model predictions from the raw source data.

-----

## References

DSTL. 2018. Relationship and entity extraction evaluation dataset. https://github.com/dstl/ re3d/. Accessed: 2021-07-01.

Greene, D. and Cunningham, P., 2006, June. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning (pp. 377-384).

Meher, S. and Brandt, P.T., 2025. ConflLlama: Domain-specific adaptation of large language models for conflict event classification. Research & Politics, 12(3), p.20531680251356282.

START (National Consortium for the Study of Terrorism and Responses to Terrorism). (2022). Global Terrorism Database 1970 - 2020 [data file]. https://www.start.umd.edu/data-tools/GTD


